bugfix(network): Reset network command id to keep commands in chronological order#2736
bugfix(network): Reset network command id to keep commands in chronological order#2736Caball009 wants to merge 6 commits into
Conversation
ee6bcd8 to
1f31061
Compare
|
| Filename | Overview |
|---|---|
| Core/GameEngine/Source/GameNetwork/NetworkUtil.cpp | Refactors the command-ID counter to file scope and adds GetCommandID()/ResetCommandID(); semantics change from pre- to post-increment is intentional and correct. |
| Core/GameEngine/Source/GameNetwork/Network.cpp | Adds reset in init() and a threshold-based reset in update() before command collection; guard condition correctly ensures reset only fires when entering a new execution frame. |
| Core/GameEngine/Include/GameNetwork/networkutil.h | Two new function declarations added cleanly; header already uses #pragma once. |
| Core/GameEngine/Source/GameNetwork/NetCommandList.cpp | Documentation-only change: adds a rationale comment explaining the known overflow limitation in addMessage() sorting. |
Sequence Diagram
sequenceDiagram
participant GL as GameLogic
participant NET as Network::update()
participant NU as NetworkUtil
participant CCL as GetCommandsFromCommandList()
participant NCL as NetCommandList
GL->>NET: frame advances
NET->>NU: GetCommandID()
alt "commandID >= 64000 AND new execution frame"
NET->>NU: ResetCommandID()
end
NET->>CCL: collect game messages
CCL->>NU: GenerateNextCommandID()
CCL->>NCL: addMessage(cmd)
NCL-->>NCL: sort by commandID
NCL-->>NET: ordered command list
Reviews (6): Last reviewed commit: "Changed implementation." | Re-trigger Greptile
1f31061 to
ebf1fa2
Compare
ebf1fa2 to
a7ff356
Compare
|
Resetting the command id just delays the potential mismatch, right? Wouldn't it be better to sort the command ids properly, even if that rarely causes mismatches with retail clients? |
It's reset every N frames, so it's not just delayed. |
a7ff356 to
3ae2622
Compare
xezon
left a comment
There was a problem hiding this comment.
The id needs to be unique per execution frame.
Can we just reset it to 0 or 1 at the beginning of every new frame? Or does that cause issues? What is the reason for it to reset every 128 frames now?
| static UnsignedShort commandID = 64000; | ||
| ++commandID; | ||
| return commandID; | ||
| static UnsignedShort commandID = 0; |
There was a problem hiding this comment.
Is it ok to start it at 0 when retail clients still start at 64000 ? I suspect this means every client can make up his own Id numbers as he pleases ?
Any clue why the original author picked 64000 as a start for it ?
There was a problem hiding this comment.
Is it ok to start it at 0 when retail clients still start at 64000 ?
Yes, this is ok. It's an important argument in favor of resetting the command ID as opposed to fixing the sorting implementation. The latter is not retail-compatible and relatively complicated.
I suspect this means every client can make up his own Id numbers as he pleases ?
Yes, each client can use their own value.
Any clue why the original author picked 64000 as a start for it ?
Not sure. Perhaps they were testing overflow behavior and forgot to change it back. If that's correct, I suppose the question becomes why didn't they notice the issues that come with overflows. They didn't have multiple instances so they'd be more constrained in their ability to stress test it, but I'm only speculating.
3ae2622 to
93c5ce7
Compare
|
When it's reset every 128 frames, isn't it possible that a packet from frame 127 comes in after it's already reset and then gets sorted wrong? |
|
My current understanding is that the ids only matter per frame, not across different frames. But then I do not understand why it needs to be reset every 128 frames instead of every frame. |
93c5ce7 to
1bccb3a
Compare
|
I noticed that at extremely high latency (beyond 64 frames runahead) the disconnect screen could be triggered by the reset somehow. I tried a ton of things, but I was unable to find out why that happens. That's why I made some changes to the implementation and only reset the command id when it gets close to overflowing.
This check should guard against that. If the check is inverted, you'll see wrong sorting once in a while.
If the runahead decreases the ids matter across multiple frames. |
xezon
left a comment
There was a problem hiding this comment.
I noticed that at extremely high latency (beyond 64 frames runahead) the disconnect screen could be triggered by the reset somehow. I tried a ton of things, but I was unable to find out why that happens. That's why I made some changes to the implementation and only reset the command id when it gets close to overflowing.
Ok so this means every ~16 minutes there is a miniscule chance of network breakdown? It would be good to find a definite fix for it so we never need to worry about this again.
If the runahead decreases the ids matter across multiple frames.
This is a big problem for resetting. Can you gather more intel about this? Can we avoid it?
Maybe a good command reset opportunity is when there is a time frame where little to no new network commands were sent to peers?
| if (frame + m_runAhead > m_lastExecutionFrame) { | ||
| // TheSuperHackers @bugfix Caball009 06/06/2026 Reset the network command id to prevent it from overflowing. | ||
| // This prevents commands from being sorted incorrectly, which can cause spurious mismatches at low CRC intervals. | ||
| if (GetCommandID() >= 64000) { |
There was a problem hiding this comment.
Check this before the previous condition, because this will be the earlier out.
|
Also inquired with Chad Gippy: If you have a monotonically increasing if (a > b)fails across the wrap boundary. A common solution is to compare the difference modulo 2¹⁶ and interpret it as a signed value: bool is_newer(uint16_t a, uint16_t b)
{
return static_cast<int16_t>(a - b) > 0;
}Examples: is_newer(100, 50); // true
is_newer(50, 100); // false
is_newer(0, 65535); // true (wrapped by 1)
is_newer(65535, 0); // falseThe subtraction is performed modulo 65536: 0 - 65535 == 1which becomes LimitationThis only works if the counters are never more than half the range apart. For a 16-bit counter: The value exactly half a range away ( a = 0;
b = 32768;You cannot tell whether A robust version is: bool is_newer(uint16_t a, uint16_t b)
{
uint16_t diff = a - b;
return diff != 0 && diff < 0x8000;
}This is essentially the same technique used for sequence numbers in networking protocols such as TCP, where wraparound is expected. |
|
So what you could do for the non retail compatible sorting: bool isCommandIdNewer(uint16_t newVal, uint16_t oldVal)
{
#if RETAIL_COMPATIBLE_NETWORKING
return newVal > oldVal;
#else
uint16_t diff = newVal - oldVal;
return diff != 0 && diff < 0x8000;
#endif
}This would require no manual ID resetting. Needs testing. |
Synchronized network commands are given a unique id so that they can be sorted chronologically. The id needs to be unique per execution frame. It's a uint16 value so it'll overflow quite fast, especially because it's never reset and the starting value is 64000 by default.
When the network runahead decreases it's expected that multiple frames are executed right after each other. I noticed an issue with the overflow when I set the CRC interval to once per frame. When the overflow happened for a client combined with a decrease of the runahead, the order of the CRC messages was no longer guaranteed to be in chronological order for that execution frame. That would cause a mismatch if the overflow didn't happen for all clients at the same time.
With a starting value of 64000 and a CRC message every frame the first overflow happens after ~25 seconds. You can see it mismatch here because of the overflow (this was with a special test build);
https://www.youtube.com/live/V_l-q9Y-DmA?t=621s
https://www.youtube.com/live/V_l-q9Y-DmA?t=9499s
I've added some test code in the second commit. If that's used without this fix, it becomes very apparent that the messages are not in the right order when the command id overflows. ~On top of that, it's very likely that the disconnect bug (disconnect screen with everyone at
59XXX msec) occurs frequently.I intend to do a follow up PR to make the
static commandIDvariable a part of theNetCommandMsgclass.